Stratified Sampling Design Based on Data Mining
نویسندگان
چکیده
OBJECTIVES To explore classification rules based on data mining methodologies which are to be used in defining strata in stratified sampling of healthcare providers with improved sampling efficiency. METHODS We performed k-means clustering to group providers with similar characteristics, then, constructed decision trees on cluster labels to generate stratification rules. We assessed the variance explained by the stratification proposed in this study and by conventional stratification to evaluate the performance of the sampling design. We constructed a study database from health insurance claims data and providers' profile data made available to this study by the Health Insurance Review and Assessment Service of South Korea, and population data from Statistics Korea. From our database, we used the data for single specialty clinics or hospitals in two specialties, general surgery and ophthalmology, for the year 2011 in this study. RESULTS Data mining resulted in five strata in general surgery with two stratification variables, the number of inpatients per specialist and population density of provider location, and five strata in ophthalmology with two stratification variables, the number of inpatients per specialist and number of beds. The percentages of variance in annual changes in the productivity of specialists explained by the stratification in general surgery and ophthalmology were 22% and 8%, respectively, whereas conventional stratification by the type of provider location and number of beds explained 2% and 0.2% of variance, respectively. CONCLUSIONS This study demonstrated that data mining methods can be used in designing efficient stratified sampling with variables readily available to the insurer and government; it offers an alternative to the existing stratification method that is widely used in healthcare provider surveys in South Korea.
منابع مشابه
Stratified and Un-stratified Sampling in Data Mining: Bagging
Stratified sampling is often used in opinion polls to reduce standard errors, and it is known as variance reduction technique in sampling theory. The most common approach of resampling method is based on bootstrapping the dataset with replacement. A main purpose of this work is to investigate extensions of the resampling methods in classification problems, specifically we use decision trees, fr...
متن کاملStratified Sampling for Association Rules Mining
It is well recognized that mining association rules in a very large database is usually time consuming due to the I/O overhead in scanning the disk resident database. As one of the techniques for reducing the I/O overhead, sampling for mining association rules has been actively investigated during the last few years. Each sampling method and algorithm proposed in the literature has its own meri...
متن کاملA stratified sampling technique based on correlation feature selection method for heart disease risk prediction system
In medical, data mining method can be utilized by the physicians to improve clinical diagnosis. In this paper a stratified approach named Correlation Feature Selection Stratified Sampling (CFS-SS) has been introduced. This method is applied to medical diagnosis heart disease risk prediction system. By using this proposed system the attributes are grouped together into homogenous sub groups, bef...
متن کاملSupport Vector Machine based on Stratified Sampling
Support vector machine is a classification algorithm based on statistical learning theory. It has shown many results with good performances in the data mining fields. But there are some problems in the algorithm. One of the problems is its heavy computing cost. So we have been difficult to use the support vector machine in the dynamic and online systems. To overcome this problem we propose to u...
متن کاملSelected Prior Research
• 1996 scaled tree-based classifiers to very large data sets. A fundamental challenge in data mining is to mine data sets that are so large that they do not fit into a computer’s memory. This is important for a wide variety of applications ranging from homeland defense to identifying fraudulent credit card transactions. One of the most accurate techniques in data mining is tree-based classifier...
متن کامل